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Given a set of variables and the correlations among them, we develop a method for 
finding clustering among the variables. The method takes advantage of information 
implicit in higher-order (not just pairwise) correlations. The idea is to define a 
Potts model whose energy is based on the correlations. Each state of this model 
is a partition of the variables and a Monte Carlo method is used to identify states 
of lowest energy, those most consistent with the correlations. A set of the 100 or 
so lowest such partitions is then used to construct a stochastic dynamics (using 
the adjacency matrix of each partition) whose observable representation gives the 
clustering. Three examples are studied. For two of them the 3 rd order correlations 
are significant for getting the clusters right. The last of these is a toy model of a 
biological system in which the joint action of several genes or proteins is necessary 
to accomplish a given process. 

PACS numbers: 87.10.Mn, 89.75.Hc, 89.75.Fb, 05.50.+q, 



Given a collection of objects that bear a relation to one another, it is often desirable to 
organize them into clusters or communities; moreover, this organization may take the form 
of a graph whose features not only allow the viewer a global image of a complex situation, 
but which may also have quantitative distance relations between and within the disparate 
communities. The applications are legion, and I mention several recent articles where this 
problem is discussed Q, 0, 0, |, 0, 0, 0, Q . 

There are also many kinds of relations among the "objects," from flow (of probability, of 
substances, of attention), to collaboration, to correlation. In this article I focus on what can 
be learned from correlations [10], and especially from third- and higher-order correlation 
functions. As far as I know, no previous method has exploited this information. Such 
correlations may be expected to be particularly important in biological applications, where, 
for example, it often happens that it takes the concerted action of many genes or proteins 
to accomplish a given biological function. 

The method makes use of previous work on the "observable representation" (OR) fill . 



12 . LL3j , but also introduces a Potts model calculation (see also [1]). The principle of the 



method is clear enough, but the details can become complicated. We suppose that there 
are N variables, Xi,...,X/v, and that they exhibit a number of correlations. Let each 
take M values. As the measure of correlations we use normalized cumulants. Thus let 
Y k = X k - (X k ), and Z k = Y k /y/^2(Y£)/(M - 1), k = 1, . . . , N. The quantities that will 
interest us are J k £ = {Z k Z^ K k £ m = {Z k Z^Z m ) ) etc. (for higher cumulants, additional terms 
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will need to be subtracted from the expectation values). 

The arrays J, etc., will be used as energy coefficients in a q state Potts model. With 
appropriate sign, positively correlated variables will be in the same Potts state. I then 
use a stochastic dynamics based on this energy as a way of discovering the lowest energy 
configurations of this system. (Interestingly, the associated stochastic process has — up to 
a factor — as its high temperature correlations just the original coefficients [141].) Each such 
configuration is essentially a partition of the set of variables, {X}. A partition corresponds 
to an adjacency matrix and all such adjacency matrices are added with an appropriate 
Boltzmann weight. Finally, this N-by-N matrix is treated as the generator of another 
stochastic process and its OR depicted. 

To illustrate the method, I work through three examples. In the first, there are 4 in- 
dependent time series from which are built several others, in combinations that I will in- 
dicate. For this example, higher order correlations are not significant, so this will serve 
as an introduction to the general OR method. To obtain a stochastic matrix, the correla- 
tion functions, J xy) are exponentiated (element-by-element) with a fictitious temperature, 

T. Thus, R^y = exp[J xy /T] for x ^ y. Let /x = max^^^ Rxy} and let R xy = R$ //i 
for x ^ y. Next, adjust the diagonal of R so its column sums are unity [l5|. This yields 
a stochastic matrix (R xy is the probability of a transition, y —> x). Call the eigenvalues of 
R, 1 — Ao > Ai > . . . (sometimes these are ordered by absolute value). The corresponding 
left and right eigenvectors are not (in general) equal, but since J is symmetric, we have 
p a (x) oc A a (x)po(x) , where the left eigenvectors are the A's and the right eigenvectors the 
p's. (^4 = 1, represents conservation of probability, and p is the stationary state.) The 
temperature T is adjusted rather loosely so as to bring Ai close to 1. In this example there 
are 11 variables. They are produced from 4 separate time series, with variables 1, 2 and 3 
based on the first, 4, 5, and 6 on the second, 7 and 8 the third, 9 and 10 the fourth, while 
#11 is mostly random, plus a small piece proportional to the sum of the generators of group 
1 and group 2. The OR in Fig. [T]is a 3-dimensional plot of the points [Ai(x), A 2 {x) ) A${x)\ 
for x in the space of 11 discrete variables. Note that the OR has grouped them properly. 
Since Ai is not extremely close to 1, some of the clusters are not at the extremal points. The 
11 th point is kind of out of things, although had it had a smaller noise component it would 
have been on a line connecting group #1 and group #2. 

For the same set of variables, using both 2-point and 3-point correlations, Fig. [2] shows 
the result of a lengthier process involving two stochastic processes. For a g-state Potts model 
let the energy be given by E = - (J k£ - J ) 6(s k , s £ ) - ^2 kAm K k£m 5(s k , s e , s m ), with J k£ 
and K k£m correlations (as defined above) and J a parameter, like q : to be selected based 
on considerations to be discussed in a moment. A Monte Carlo simulation is run for this 
system at a moderately low temperature, T. I have found that the most useful outcomes 
are obtained when, after an initial warmup stage, the system moves among the 100 or so 
lowest energy states, and T is selected so that the system settles into this number states. (So 
far, missing sectors of state space, as could be expected in spin-glass-like structures, have 
not been in evidence; however, for some systems they could certainly appear.) These low 
energy states are degenerate under permutations of Potts-labels; however, each uniquely 
defines a partition of the original variables. The parameter J is selected for non-trivial 
partitioning; thus it would avoid favoring only purely ferromagnetic states in the case where 
all correlations are positive. Finally, q is selected large enough to give non-trivial partitions; 
taking it larger should not (and does not in practice, once a reasonable q is selected) change 
the partitions. 
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FIG. 1: Observable representation sorting of objects based on a fictitious dynamics in which 
the transition probabilities are (essentially) exp[— ^ • J^/T], with the correlations. What is 
plotted is the triplet of left eigenvector values, [Ai(x), A<i{x), As(x)], where the eigenvector labels 
correspond to descending algebraic value of the associated eigenvalue (the zeroth is the trivial 
Aq(x) = 1). Each little circle in the graph is one of the 11 points of the space and its identity 
printed near it. The convex hull of this set is indicated in light shading. 

Note in particular that this process uses three-point correlations, and can use yet higher 
correlations in an obvious way. 

Once the significant partitions are identified an adjacency matrix is constructed for each. 
These are TV- by- TV matrices (with TV the number of original variables). For each of these 
there is a frequency of occurrence in the simulation and an N-by-N matrix is constructed 
by adding these adjacency matrices with that weight. Finally, this matrix is made into a 
stochastic matrix as in [15] or by dividing by column sums (the former method preserves the 
symmetry of the correlations). Fig. [2] is the OR for the first 3 (non-trivial) left eigenfunctions 
of that stochastic matrix. 

It is evident, that both the direct use of exp(J/T) and the more complicated method 
using the Potts model have successfully divined the appropriate clustering of variables. I 
remark that in this example K is quite small. 

A second example focuses on a case where the 3-point correlations are much larger than 
the 2-point correlations. It is well known that for r.v.'s X and y, independence implies 
(XY) = (X)(Y), but not vice versa. By a Monte Carlo method [16j I produced three triples 
of r.v.'s with 2-point correlations at the level of 10 -4 , and with 3-point correlations more 
than 100 times larger. As for Figs. [1] and El I show in Fig.[3]the resulting clustering. The nine 
r.v.'s were grouped as 1-2-3, 4-5-6, and 7-8-9 and the Potts model method clearly displays 
that relation. On the other hand, based on 2-point functions and the exp( J/T) method, no 
relation among these variables is noted at all. 

A third situation where the OR/Potts model method can be useful is where there may 
be several "actors" and "processes." I have in mind a cartoon model of biological processes 
in which the actors (henceforth sans quotation marks) may be proteins or genes and the 
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FIG. 2: Observable representation for a fictitious dynamics based on a weighted sum of adjacency 
matrices produced by a second fictitious stochastic process based on a Potts model whose energy 
is built from the correlation functions (including 3-point) of a set of variables. The convex hull of 
this set is indicated in light shading. 




FIG. 3: Observable representation. In the first figure, using exp(J/T), only 2-point correlations 
relate the r.v.'s to one another. In the second, using the Potts model/OR method, the (correct) 
grouping (1,2,3), (4,5,6) and (7,8,9) is found. 

processes reactions that the actors may (collectively or individually) excite or inhibit. As 
I will illustrate, this can pinpoint actors that tend to work together or in opposition and 
may also be useful in discerning the "processes." The information needed for this would be 
correlations (of 2 nd and higher order), something that for gene expression is now available 

Q. 

I will now give an example in which the usual correlations give no information and one 
is only able to discern an underlying grouping using 3 rd order correlations. I do not need 
to stress that this example is contrived to make exactly this point, and that for realistic 
situations one would get partial information at each level of correlations. 

Suppose we have 12 actors and 59 processes. Three of the processes are performed by 
having (only) actors [1,2,3,4], [5,6,7,8] and [9,10,11,12] participating. The other 56 processes 



FIG. 4: Observable representation. In the first figure only 2-point correlations have been used to 
relate the r.v.'s to one another. In the second, using the Potts model/OR method, the (correct) 
grouping (1,2,3,4), (5,6,7,8) and (9,10,11,12) is found. Whatever grouping may or may not be 
evident in the first figure is an artifact, since the eigenvalues are degenerate and the selection of 
which eigenvectors to plot is based on matlab™'s arbitrary choices. 

are all those that involve only 2 actors, omitting those pairs that appear in the first 3 
processes. Thus I can write the process involving [1,2,3,4] as a vector: [1,1,1,1,0,0,0,0,0,0,0,0], 
with each column corresponding to an actor, and l's indicating that the actor is involved in 
the process. With this notation, among the 56 other processes will be [1,0,0,0,1,0,0,0,0,0,0,0], 
involving #1 and #5. But [1,1,0,0,0,0,0,0,0,0,0,0] does not appear, since the pair [1,2] is 
included in one of the first three processes. For this pattern of activity, the two point 
correlations are essentially constant and the only information arises from 3 point functions. 
See Fig. H 

In conclusion I have shown how correlations, including (and especially) 3 rd order and 
higher, can be used to group variables. Calling the 2-point correlations, J, a simple study 
of the matrix exp(J xy /T) (with T a temperature-like parameter) can already give such 
information using the observable representation (OR) and by many other methods. However, 
to incorporate information available from higher order correlations, we go to an exponentially 
larger space on which we consider a random process based on a Potts model whose energy 
is related to all the correlations. For this process the OR is impractical (because of the size 
of the space), but the usual kind of Monte Carlo study can identify the lowest energy states. 
Using these we deduce the most significant partitions of the original variables and use the 
OR on the original space to deduce groupings. 
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